Leave-One-Out Phrase Model Training for Large-Scale Deployment
نویسندگان
چکیده
Training the phrase table by force-aligning (FA) the training data with the reference translation has been shown to improve the phrasal translation quality while significantly reducing the phrase table size on medium sized tasks. We apply this procedure to several large-scale tasks, with the primary goal of reducing model sizes without sacrificing translation quality. To deal with the noise in the automatically crawled parallel training data, we introduce on-demand word deletions, insertions, and backoffs to achieve over 99% successful alignment rate. We also add heuristics to avoid any increase in OOV rates. We are able to reduce already heavily pruned baseline phrase tables by more than 50% with little to no degradation in quality and occasionally slight improvement, without any increase in OOVs. We further introduce two global scaling factors for re-estimation of the phrase table via posterior phrase alignment probabilities and a modified absolute discounting method that can be applied to fractional counts.
منابع مشابه
Access control in ultra-large-scale systems using a data-centric middleware
The primary characteristic of an Ultra-Large-Scale (ULS) system is ultra-large size on any related dimension. A ULS system is generally considered as a system-of-systems with heterogeneous nodes and autonomous domains. As the size of a system-of-systems grows, and interoperability demand between sub-systems is increased, achieving more scalable and dynamic access control system becomes an im...
متن کاملFast exact leave-one-out cross-validation of sparse least-squares support vector machines
Leave-one-out cross-validation has been shown to give an almost unbiased estimator of the generalisation properties of statistical models, and therefore provides a sensible criterion for model selection and comparison. In this paper we show that exact leave-one-out cross-validation of sparse Least-Squares Support Vector Machines (LS-SVMs) can be implemented with a computational complexity of on...
متن کاملHow Does Large-scale Wind Power Generation Affect Energy and Reserve Prices?
Intermittent nature of wind power faced ISO and power producers with new challenges. Wind power uncertainty has increased the required reserve capacity and deployment reserve. Consequently, large-scale wind power generation increases ISO costs and consequently reserve prices. On the other hand, since wind power producers are price taker, large-scale wind power generation decreases residual dema...
متن کاملWithdrawing an example from the training set: An analytic estimation of its effect on a non-linear parameterised model
For a non-linear parameterised model, the effects of withdrawing an example from the training set can be predicted. We focus on the prediction of the error on the left-out example, and of the confidence interval for the prediction of this example. We derive a rigorous expression of the first-order expansion, in parameter space, of the gradient of a quadratic cost function, and specify its valid...
متن کاملLarge-scale Reordering Model for Statistical Machine Translation using Dual Multinomial Logistic Regression
Phrase reordering is a challenge for statistical machine translation systems. Posing phrase movements as a prediction problem using contextual features modeled by maximum entropy-based classifier is superior to the commonly used lexicalized reordering model. However, Training this discriminative model using large-scale parallel corpus might be computationally expensive. In this paper, we explor...
متن کامل